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Linear Support Vector Machine Course Introduction 


Course History 


Coursera Version 
e 8 weeks of ‘foundations’ 
(previous course) + 8 weeks 
of ‘techniques’ (this course) 
e Mandarin teaching to reach 
more audience in need 
e slides teaching improved 
with Coursera’s quiz and 
homework mechanisms 


goal: try making Coursera version 
even better than NTU version | 
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e 15-17 weeks (2+ hours) 
e highly-praised with English 
and blackboard teaching 





Linear Support Vector Machine Course Introduction 


Course Design 


from Foundations to Technigues 


e mixture of philosophical illustrations, key theory, core algorithms, 
usage in practice, and hopefully jokes :-) 
e three major techniques surrounding feature transforms: 
e Embedding Numerous Features: how to exploit and regularize 
numerous features? 
—inspires Support Vector Machine (SVM) model 
Combining Predictive Features: how to construct and blend 
predictive features? 
—inspires Adaptive Boosting (AdaBoost) model 
Distilling Implicit Features: how to identify and learn implicit 
features? 
—inspires Deep Learning model 











allows students to use ML professionally | 
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Linear Support Vector Machine Course Introduction 3 
Fun Time 
Which of the following description of this course is true? 
© the course will be taught in Taiwanese 


@ the course will tell me the techniques that create the android 
Lieutenant Commander Data in Star Trek 


@ the course will be 16 weeks long 
@ the course will focus on three major techniques 








Linear Support Vector Machine Course Introduction 


Fun Time 


Which of the following description of this course is true? 





@ the course will be taught in Taiwanese 


@ the course will tell me the techniques that create the android 
Lieutenant Commander Data in Star Trek 


© the course will be 16 weeks long 
© the course will focus on three major techniques 











Reference Answer: (4) 
@ no, my Taiwanese is unfortunately not 
good enough for teaching (yet) 


@ no, although what we teach may serve as 
building blocks 


@ no, unless you have also joined the 
previous course 


@ yes, let’s get started! 
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Roadmap 
@ Embedding Numerous Features: Kernel Models 


Lecture 1: Linear Support Vector Machine 


e Course Introduction 

e Large-Margin Separating Hyperplane 

e Standard Large-Margin Problem 

ə Support Vector Machine 

e Reasons behind Large-Margin Hyperplane 








@ Combining Predictive Features: Aggregation Models 


© Distilling Implicit Features: Extraction Models 





Linear Support Vector Machine Large-Margin Separating Hyperplane 


Linear Classification Revisited 

















PLA/pocket 
o 
o 

h(x) = sign(s o 
Xo $ bs 
x) x z 
X2 h(x) 

4a) (linear separable) 

Xa 


plausible err = 0/1 
(small flipping noise) 
minimize specially 


linear (hyperplane) classifiers: 
h(x) = sign(w’x) | 
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Linear Support Vector Machine Large-Margin Separating Hyperplane 


Which Line Is Best? 








e PLA? depending on randomness 
e VC bound? whichever you like! 
Four(W) < Ein(w) + Q(H) 
“SS “sa” 
0 Ac=d+1 





You? rightmost one, possibly :-) 
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Linear Support Vector Machine Large-Margin Separating Hyperplane 


Why Rightmost Hyperplane? 








informal argument 


if (Gaussian-like) noise on future x ~ Xp: 

Xn further from hyperplane distance to closest Xp 
<=> tolerate more noise <=} amount of noise tolerance 
<=> more robust to overfitting <=} robustness of hyperplane 







rightmost one: more robust 
because of larger distance to closest x, | 


Linear Support Vector Machine Large-Margin Separating Hyperplane 


Fat Hyperplane 








x x x 

x o o x o o x o o 
x x x 

b o o x o o x o o 























e robust separating hyperplane: fat 
—far from both sides of examples 


e robustness = fatness: distance to closest Xn 





goal: find fattest separating hyperplane 
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Linear Support Vector Machine Large-Margin Separating Hyperplane 


Large-Margin Separating Hyperplane 











max fatness(w) 


subject to w classifies every (Xn, Yn) correctly 
fatness(w) = min yuistance(Xn, w) 
nin 





e fatness: formally called margin 
e correctness: yn = sign(w’ Xp) 





goal: find largest-margin 
separating hyperplane | 


Linear Support Vector Machine Large-Margin Separating Hyperplane 


Large-Margin Separating Hyperplane 











max margin(w) 
w 


subjectto every y;w! Xn > 0 
margin(w) = min yaistance(Xn, w) 
=A wees 










e fatness: formally called margin 
e correctness: yn = sign(w’ Xp) 


goal: find largest-margin 
separating hyperplane | 


Linear Support Vector Machine Large-Margin Separating Hyperplane 


Fun Time 


Consider two examples (v, +1) and (—v, —1) where v € R? (without 
padding the vo = 1). Which of the following hyperplane is the 
largest-margin separating one for the two examples? You are highly 
encouraged to visualize by considering, for instance, v = (3, 2). 


0 x, =0 


0 x%=0 
© vX + Vex = 0 
© vox; + V4Xo = 0 





Linear Support Vector Machine Large-Margin Separating Hyperplane 


Fun Time 


Consider two examples (v, +1) and (—v, —1) where v € R? (without 
padding the vo = 1). Which of the following hyperplane is the 
largest-margin separating one for the two examples? You are highly 
encouraged to visualize by considering, for instance, v = (3, 2). 


0 x; =0 
@ x%.=0 
© vX + Vex = 0 
© vox; + V1Xo = 0 








Reference Answer: 6) 


Here the largesi-margin separating hyperplane 
(line) must be a perpendicular bisector of the 
line segment between v and —v. Hence vis a 
normal vector of the largest-margin line. The 
result can be extended to the more general 


case of v € R°. 
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Linear Support Vector Machine Standard Large-Margin Problem 


Distance to Hyperplane: Preliminary 
max = margin(w) 
w 
subject to every y,w! x, > 0 


margin(w) = min „dstance(xn, w) 
n=1,..., 










‘shorten’ x and w 
distance needs wo and (w;,..., Wa) differently (to be derived) 


for this part: h(x) = sign(w’x + b) 
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Linear Support Vector Machine Standard Large-Margin Problem 


Distance to Hyperplane 
want: distance(x, b, w), with hyperplane wx’ + b = 0 





consider x’, x” on hyperplane 
@ w’x’ = —b,w7x" = —b 
© w L hyperplane: 











w7 (x” — x’) =0 
— pa 
vector on hyperplane 


© distance = project (x — x’) to L hyperplane 


w7 
[|| 


distance(x, b, wW) = 


RER) 
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Linear Support Vector Machine Standard Large-Margin Problem 


Distance to Separating Hyperplane 


i 1 
distance(x, b, w) = Tw” * + b| 


e separating hyperplane: for every n 
Yaw Ky Eb) > 0 


e distance to separating hyperplane: 


; 1 
distance(Xxn, b, w) = iwi yaw x, + b) 





max margin(b,w) 
b,w 


subject to every y,(w’x,, + b) > 0 


margin(b,w) = min | YW Xn + b) 





Linear Support Vector Machine Standard Large-Margin Problem 


Margin of Special Separating Hyperplane 
max margin(b, w) 


subject to every y,(w’ x, + b) > 0 
margin(b,w) = min | pw /n(W Xn + b) 








e w7x + b = 0 same as 3w’x + 3b = 0: scaling does not matter 
e special scaling: only consider separating (b, w) such that 


min yla(w Xn + b) = 1 = margin(b, w) = 
n=1...., 





ila 


max Twi 


b,w 


subjectto every yn(w/ x, + b) > 0 
min yp(w’x, +b) =1 
N= 


genen 





Linear Support Vector Machine Standard Large-Margin Problem 


Standard Large-Margin Hyperplane Problem 


1 F 
m — ectto min Ww X p= 
nax jwi subj a nir i Yn(W' Xn + b) 





necessary constraints: yp(w’x, + b) > 1 forall n 








original constraint: minn—1,..v Yn(WTXn +b) = 1 
want: optimal (b, w) here (inside) 





if optimal (b, w) outside, e.g. yn(w! Xp + 9 > 1.126 for all n 
—can scale (b, w) to “more optimal” (755 156) t125) (contradiction!) 


final change: max => min, remove y , add 5 


min iw'w 
b,w 


subjectto yn(w7Xn + b) > 1 for all n 
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Linear Support Vector Machine Standard Large-Margin Problem 


Fun Time 


Consider three examples (x;,+1), (X2,+1), (X3, —1), where 
Xı = (3,0), X2 = (0,4), X3 = (0,0). In addition, consider a hyperplane 
Ki + X2 = 1. Which of the following is not true? 
@ the hyperplane is a separating one for the three examples 
@ the distance from the hyperplane to x; is 2 
© the distance from the hyperplane to xg is a 
© the example that is closest to the hyperplane is x3 








Standard Large-Margin Problem 


Fun Time 


Linear Support Vector Machine 


Consider three examples (X;,+1), (Ko, +1), (X3, —1), where 
Xı = (3,0), X2 = (0,4), X3 = (0,0). In addition, consider a hyperplane 


Ki + X2 = 1. Which of the following is not true? 
@ the hyperplane is a separating one for the three examples 
@ the distance from the hyperplane to x, is 2 
© the distance from the hyperplane to xg is a 
© the example that is closest to the hyperplane is x3 


Reference Answer: (2) 


The distance from the hyperplane to x4 is 
Jg(3 + 0-1) = v2. 
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Linear Support Vector Machine Support Vector Machine 


Solving a Particular Standard Problem 


min 4w'w 
b,w 


subjectto = yp(w! x, + b) > 1 for all n 












0 0 = asin 10) 
x= 2 2 y- —1 —2W; —2Wo—b>1 (ii) 
2 0 4 ow, +b>1 (iii) 
SO ki 3wi +b>1 (iv) 


oJ H & (üi) => m >+ 
(ii) & (ii) => Wo < -—1 
e (w; = 1, w2 = —1, b = —1) at lower bound and satisfies (i) — (iv) 


l= ww>1 


Qsvu(X) = sign(x; — Ka — 1): SVM? :-) 





Linear Support Vector Machine Support Vector Machine 


Support Vector Machine (SVM) 









optimal solution: (wi = 1, w> = —1,b = —1) 







margin(b, w) =o = 


























e examples on boundary: ‘locates’ fattest hyperplane 
other examples: not needed 


e call boundary example support vector (candidate) 





support vector machine (SVM): 
learn fattest hyperplanes 
(with help of support vectors ) 





Linear Support Vector Machine Support Vector Machine 


Solving General SVM 


min 3w’w 
b,w 


subjectto = yn(w! Xn + b) > 1 forall n 





e not easy manually, of course :-) 


e gradient descent? not easy with constraints 
e luckily: 
e (convex) quadratic objective function of (b, w) 
e linear constraints of (b, w) 


—quadratic programming 





quadratic programming (QP): 
‘easy’ optimization problem | 


Linear Support Vector Machine Support Vector Machine 


Quadratic Programming 


optimal (b, w) = ? optimal u + QP(Q,p,A,c) 
duals ted T 
min zW W ny g Qu +p u 


subjectto yp(w’x,+b)>1,| subjectto aļu > cm, 
forn=1,2,...,N form=1,2,...,M 








SVM with general QP solver: 
easy if you’ve read the manual :-) | 


Linear Support Vector Machine Support Vector Machine 
SVM with QP Solver 


Linear Hard-Margin SVM Algorithm 


a) OF T7 TAA 
0 Q= 04 I 'p— 094-4, — Vn a |, cn — 1 


@ | 4 | = APPA.) 


© return b & w as gsvm 








e hard-margin: nothing violate ‘fat boundary’ 
e linear: x, 





want non-linear? 
Zn = &(Xn)—remember? :-) 





Linear Support Vector Machine Support Vector Machine 


Fun Time 


Consider two negative examples with x; = (0,0) and X2 = (2, 2); two 
positive examples with x3 = (2,0) and x4 = (3,0), as shown on page 
17 of the slides. Define u, Q, p, Cn as those listed on page 20 of the 
slides. What are a} that need to be fed into the QP solver? 


@ aj =[-1,0,0]) ,aJ =[-1,2,2] , af =[-1,2,0] , a] = [-1,3,0] 
@ a! =[1,0,0] , al =[1,-2,-2] , aJ =[—-1,2,0] , a] = [-1,3, 0] 
© a! =[1,0,0] pay = [ie 2| , al = [1,2,0] nah ISO 


© al = [-1,0, 0] , af =[ le 2, 2] na, (12,0) , al = [1,3,0] 








Linear Support Vector Machine Support Vector Machine 


Fun Time 


Consider two negative examples with x; = (0,0) and X2 = (2,2); two 
positive examples with x3 = (2,0) and x4 = (3,0), as shown on page 
17 of the slides. Define u, Q, p, Cn as those listed on page 20 of the 
slides. What are a} that need to be fed into the QP solver? 


@ al =[-1,0,0) ,al=[-1,2,2) , al =[-1,2,0] , al = [-1,3,0] 
@ a7 = [1,0,0] W = (te) El = a , al = [-1,3,0] 
© a7 = [1,0,0] , al = [1,2,2] , al = [1,2,0] , al = [1,3,0] 
@ al =[-1,0,0) =, al =[-1,-2,-2] , at = [1,2,0] , al = [1,3,0] 











Reference Answer: (4) 


We need aj =yn[ 1 x} |. 
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Linear Support Vector Machine Reasons behind Large-Margin Hyperplane 


Why Large-Margin Hyperplane? 


min 3w!w 


¥n(w!Z,+ b) > 1 forall n 


b,w 


subject to 

























minimize constraint 
regularization E; ww<cC 
SVM w'w | En = 0 [and more] 






SVM (large-margin hyperplane): 
‘weight-decay regularization’ within Ein = 0 





Linear Support Vector Machine Reasons behind Large-Margin Hyperplane 


Large-Margin Restricts Dichotomies 
consider ‘large-margin algorithm’ Ap: 
either returns g with margin(g) > p (if exists), or 0 otherwise 


— 


Ao: like PLA => shatter ‘general’ 3 inputs 


o 












































A1126: more strict than SVM = cannot shatter any 3 inputs 





FNA 


fewer dichotomies — smaller ‘VC dim’ — better generalization J 
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Linear Support Vector Machine Reasons behind Large-Margin Hyperplane 


VC Dimension of Large-Margin Algorithm 


fewer dichotomies — smaller ‘VC dim’ 
considers d\~(A,) [data-dependent, need more than VC] 
instead of dvc (H) [data-independent, covered by VC] 










d, (A) when ¥ = unit circle in R? 





e p = Q: just perceptrons (dc = 3) 


e p> 2 cannot shatter any 3 inputs 
(dvo < 3) 
—some inputs must be of distance < /3 


generally, when ¥ in radius-R hyperball: 





yee 
Ayc(A,) < min (Za) +1 < d+1 
Ac (perceptrons) 
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Linear Support Vector Machine Reasons behind Large-Margin Hyperplane 


Benefits of Large-Margin Hyperplanes 








large-margin 
hyperplanes | hyperplanes hyperplanes 
+ feature transform ® 
He even fewer not many many 
boundary simple simple sophisticated 











e not many good, for dvc and generalization 
e sophisticated good, for possibly better Ein 





a new possibility: non-linear SVM 








large-margin 

hyperplanes 

+ numerous feature transform ® 
7# not many 

boundary sophisticated 
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Linear Support Vector Machine Reasons behind Large-Margin Hyperplane 


Fun Time 


Consider running the ‘large-margin algorithm’ A, with p = i ona 
Z-space such that z = ®(x) is of 1126 dimensions (excluding zo) and 
\|Z|| < 1. What is the upper bound of avc( Ap) when calculated by 


min (Sa) +1 
05 

@ 17 

© 1126 

@ 1127 





Linear Support Vector Machine Reasons behind Large-Margin Hyperplane 


Fun Time 


Consider running the ‘large-margin algorithm’ A, with p = i ona 
Z-space such that z = ®(x) is of 1126 dimensions (excluding zo) and 
\|Z|| < 1. What is the upper bound of avc( Ap) when calculated by 
min (5. d) +1? 

05 

@ 17 

© 1126 

© 1127 | 


Reference Answer: (2) 


By the description, d = 1126 and R = 1. So 
the upper bound is simply 17. 
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Linear Support Vector Machine Reasons behind Large-Margin Hyperplane 
Summary 
@ Embedding Numerous Features: Kernel Models 


Lecture 1: Linear Support Vector Machine 


e Course Introduction 
from foundations to techniques 
ə Large-Margin Separating Hyperplane 
intuitively more robust against noise 
e Standard Large-Margin Problem 
minimize ‘length of w’ at special separating scale 
e Support Vector Machine 
‘easy’ via quadratic programming 
e Reasons behind Large-Margin Hyperplane 
fewer dichotomies and better generalization 





e next: solving non-linear Support Vector Machine 
@ Combining Predictive Features: Aggregation Models 
@ Distilling Implicit Features: Extraction Models 
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